- Intro & general tips
- R performance tips: patterns to use and avoid
- Vectors & Matrices
- Memory management, large tables
- Profiling and Benchmarking
- Loops
August 22, 2018
git to keep track of changesdata.table, bigmemory, dplyr, RSQLite, snow, multicore, paralleltop, Activity Monitor, Task Managercmpfun, or call a compiled language (e.g. C, C++)+, -, *, /, ^, %/%, %%abs, sqrt, exp, log, log10, cos, sin, tan, sum, prod&, |, !==, !=, <, >, <=, >=nchar, tolower, toupper, grep, sub, gsub, strsplitifelse (pure R code)which, which.min, which.max, pmax, pmin, is.na, any, all, rnorm, runif, sprintf, rev, paste, as.integer, as.characterx and fill it with zerosn <- 10 x <- double(n) x
[1] 0 0 0 0 0 0 0 0 0 0
x by assignmentx[15] <- 100 x
[1] 0 0 0 0 0 0 0 0 0 0 NA NA NA NA 100
xlength(x) <- 5 x
[1] 0 0 0 0 0
x <- rnorm(10) x
[1] -0.3482306 -1.2933403 0.1313107 -0.1566137 -0.1810409 -0.9515630 [7] -0.1438242 -2.2105651 1.1689162 -0.6161509
x[3:6]
[1] 0.1313107 -0.1566137 -0.1810409 -0.9515630
x[x > 0]
[1] 0.1313107 1.1689162
x[is.na(x)] <- 0
m <- matrix(rnorm(100), 10, 10)
m[3:4, c(5,7,9)]
[,1] [,2] [,3] [1,] 2.0858869 -1.0977206 2.630878 [2,] -0.8048464 -0.8215921 -0.971008
m[cbind(3:6, c(2,4,6,9))]
[1] 0.78458919 -0.47864075 0.02153362 -1.13464444
head(m[m > 0])
[1] 0.43593849 0.12611507 0.10813142 0.04578266 0.78458919 0.82060834
m[is.na(m)] <- 0
tracemem reports when an object is duplicated, which is very useful for debugging performance problems.
In this example, object duplication is expected and helpful.
x <- double(10) tracemem(x)
[1] "<0x55b01443a330>"
y <- x y[1] <- 10
tracemem[0x55b01443a330 -> 0x55b014441338]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
.Internal(inspect(x))
@55b01443a330 14 REALSXP g0c5 [NAM(2),TR] (len=10, tl=0) 0,0,0,0,0,...
.Internal(inspect(y))
@55b014441338 14 REALSXP g0c5 [NAM(1),TR] (len=10, tl=0) 10,0,0,0,0,...
x[1] <- 50
tracemem[0x55b01443a330 -> 0x55b014458a48]: eval eval withVisible withCallingHandlers handle timing_fn evaluate_call <Anonymous> evaluate in_dir block_exec call_block process_group.block process_group withCallingHandlers process_file <Anonymous> <Anonymous>
nrow poisons for duplication> m <- matrix(0, 3, 3) > tracemem(m) [1] "<0x7fc168d29df0>" > m[1,1] <- 1 > nrow(m) [1] 3 > m[1,1] <- 2 tracemem[0x7fc168d29df0 -> 0x7fc168d21f58]:
R makes it easy to read entire data sets in one operation, but reading it in parts can be much more efficient.
foreach & iterators packages provide tools to split inputs into smaller piecessplit, awk, etc) or other languages (e.g. Python to preprocess
read.tableThe read.table function is commonly used for reading data files, but it can be very slow on large files
colClasses argument can improve performancecolClasses can be used to skip a column, using less memorynrows argumentscan function can be fasterreadr, data.table, sqldf, and bigmemoryreadrread.tabledata.tabledata.frame, but without row labelsbigmemorybig.matrixbiganalyticsapply, biglm, bigglm, bigkmeans, colmaxSaving data in a binary format can make it much faster to read the data later. There are a variety of functions available to do that:
save/loadwriteBin/readBinwrite.big.matrix/read.big.matrix (from bigmemory)Consider putting data into an SQLite database.
RSQLite package is easy to usesqldf package may be useful alsoSee also: http://brettklamer.com/diversions/statistical/faster-blas-in-r/ https://cran.r-project.org/doc/manuals/r-release/R-admin.html#BLAS
Intel Math Kernel Libraries (MKL) also fast, trickier to get working from scratch
MacOS
brew install openblas brew install r --with-openblas
Linux (Ubuntu)
sudo apt-get install libopenblas-base
My Machine: Ubuntu Bionic, i9-8950HK CPU, R 3.4.4, OpenBLAS 0.2.20 15 tests from http://r.research.att.com/benchmarks/R-benchmark-25.R
Default R: ~29 Seconds
R with OpenBLAS: ~3 seconds
R has builtin support for profiling, but there are additional
packages available:
proftoolsprofvis (RStudio integration)proftoolsf <- function(a) { g1(a) + g2(2 * a) }
g1 <- function(a) { h(a) }
g2 <- function(a) { sqrt(a) }
h <- function(a) {
b <- double(length(a))
for (i in seq_along(a)) {
b[i] <- sqrt(a[i])
}
b
}
proftoolsx <- 1:1000000
Rprof('prof.out')
for (i in 1:10) {
y <- f(x)
}
Rprof(NULL)
summaryRprof("prof.out")$by.self
self.time self.pct total.time total.pct "h" 0.44 50.00 0.46 52.27 "g2" 0.34 38.64 0.40 45.45 "sqrt" 0.04 4.55 0.04 4.55 "f" 0.02 2.27 0.88 100.00 "*" 0.02 2.27 0.02 2.27 "double" 0.02 2.27 0.02 2.27
profvisprofvis({
for (i in 1:10) {
y <- f(x)
}
})
Knowing where code is slow via profiling, use benchmarking tools
system.time is useful for long running codemicrobenchmark package is useful for analyzing short running codeCreate a vector of length n where all values are x
bad.for <- function(n,x) {
result <- NULL
for (i in 1:n) {
result[i] <- x
}
result
}
okay.for <- function(n,x) {
result <- double(n)
for (i in 1:n) {
result[i] <- x
}
result
}
Improvement over the previous example, but it's still slow because of the many tiny iterations.
strange.for <- function(n, x) {
result <- NULL
for (i in n:1) {
result[i] <- x
}
result
}
Is this loop faster or slower than the previous two?
# use of vector assignment
vector.assn <- function(n, x) {
result <- double(n)
result[] <- x
result
}
We can also use vector assignment
built.in <- function(n, x) {
rep(x, n)
}
Or, we could read the fine manual and use a built-in function
n <- 10000 x <- 7 bad.result <- bad.for(n, x) okay.result <- okay.for(n, x) strange.result <- strange.for(n, x) vector.result <- vector.assn(n, x) built.result <- built.in(n, x) c(identical(bad.result, okay.result), identical(bad.result, strange.result), identical(bad.result, vector.result), identical(bad.result, built.result))
[1] TRUE TRUE TRUE TRUE
res <- microbenchmark(bad=bad.for(n,x), okay=okay.for(n,x), strange=strange.for(n,x),
vector=vector.assn(n,x), builtin=built.in(n,x))
kable(summary(res, unit="relative"))
| expr | min | lq | mean | median | uq | max | neval |
|---|---|---|---|---|---|---|---|
| bad | 191.154555 | 183.385372 | 118.896550 | 191.921492 | 192.342065 | 8.623180 | 100 |
| okay | 39.014038 | 35.618441 | 20.695163 | 34.849552 | 34.034481 | 1.214267 | 100 |
| strange | 37.584503 | 33.782186 | 19.773947 | 33.462746 | 33.079933 | 1.187464 | 100 |
| vector | 1.763006 | 1.661315 | 1.800682 | 1.653194 | 1.627932 | 2.434292 | 100 |
| builtin | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 100 |
autoplot(res)
Create a matrix with n rows and x columns
Each value in the matrix is sampled from normal distribution, \(\mu=0 , \sigma=1\)
Generate a matrix of values sampled from normal distribution n rows, x columns
bad.norm <- function(n,x) {
m <- NULL
for (i in 1:n) {
m <- rbind(m, rnorm(x))
}
m
}
Just like before, we build a matrix and populate it with a for loop
ok.norm <- function(n,x) {
m <- matrix(0, nrow=n, ncol=x)
for (i in 1:n) {
m[i,] <- rnorm(100)
}
m
}
lapply.norm <- function(n,x) {
do.call('rbind', lapply(1:n, function(i) rnorm(x)))
}
best.norm <- function(n,x) {
m <- rnorm(x * n)
dim(m) <- c(x, n)
t(m)
}
n <- 600 x <- 100 # Verify correct results set.seed(123); bad.result <- bad.norm(n,x) set.seed(123); ok.result <- ok.norm(n,x) set.seed(123); lapply.result <- lapply.norm(n,x) set.seed(123); best.result <- best.norm(n,x) c(identical(bad.result, ok.result), identical(bad.result, lapply.result), identical(bad.result, best.result))
[1] TRUE TRUE TRUE
res <- microbenchmark(bad=bad.norm(n,x), ok=ok.norm(n,x),
lapply=lapply.norm(n,x), best=best.norm(n,x))
kable(summary(res, unit="relative"))
| expr | min | lq | mean | median | uq | max | neval |
|---|---|---|---|---|---|---|---|
| bad | 10.644155 | 10.925157 | 11.400705 | 10.994154 | 11.181110 | 15.119874 | 100 |
| ok | 1.315940 | 1.336382 | 1.383705 | 1.337658 | 1.357326 | 1.646225 | 100 |
| lapply | 1.367142 | 1.382912 | 1.424668 | 1.374325 | 1.370746 | 1.510307 | 100 |
| best | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 100 |
autoplot(res)
enableJIT(0)
[1] 3
fun.for <- function(x, seed=1423) {
set.seed(seed)
y <- double(length(x))
for (i in seq_along(x)) {
y[i] <- rnorm(1) * x[i]
}
y
}
fun.for.compiled <- cmpfun(fun.for)
x <- 10000
res <- microbenchmark(fun.for=fun.for(x),
fun.for.compiled=fun.for.compiled(x))
kable(summary(res, unit="relative"))
| expr | min | lq | mean | median | uq | max | neval |
|---|---|---|---|---|---|---|---|
| fun.for | 1.138554 | 1.145816 | 1.430793 | 1.141394 | 1.142783 | 7.445074 | 100 |
| fun.for.compiled | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 100 |
autoplot(res)